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Abstract 

We offer a novel view of AdaBoost in a statistical setting. We propose a Bayesian 
model for binary classification in which label noise is modeled hierarchically. Us- 
ing variational inference to optimize a dynamic evidence lower bound, we derive 
a new boosting-like algorithm called VIBoost. We show its close connections to 
AdaBoost and give experimental results from four datasets. 

1 Introduction 

Boosting, and in particular AdaBoost |20, 5, 6|, is an effective method of aggregating classifiers. 
AdaBoost produces a reliable binary classifier and often avoids overfitting. Nevertheless, it can be 
sensitive to "noisy" data and may severely underperform as a result. In this paper, we embed binary 
classification in a Bayesian model and show how it interfaces with the boosting paradigm. With this 
model, we can address the vulnerability to noise in a principled way. 

Real-world data will almost always include noise, even with binary labels. In the U.S. Presidential 
election of 2000, the country was kept in suspense for more than a month while votes were recounted 
in the state of Florida. During the recount it had become apparent that the use of the "butterfly ballot" 
had confused voters [23 1. Probabilistically, we can model a confused voter as one who casts a vote 
that is independent of his/her actual intention. These votes — borne out of confusion — are considered 
"noisy" and any attempt to learn a voter-to-vote connection, e.g., via boosting, becomes difficult. 
However, this does not preclude the extraction of important noise information. If we can detect and 
quantify the noise properties of a given dataset, then it should be reflected in our expectations of 
constructing a good classifier. 

In addressing label noise, we have chosen to interpret aggregating classifiers in a fully-Bayesian 
model. Once in place, this model lets us incorporate additional latent variables to account for noise. 
In our context, noise means that the true label is ignored and randomly reassigned, i.e., it may be 
inverted. A learning algorithm such as AdaBoost is sensitive to this type of label perturbation be- 
cause it focuses on the examples that pose a greater difficulty of classification. Using this augmented 
model, we construct an algorithm that performs approximate inference of the posterior distribution 
associated with the latent variables. Although the intent is inference, the algorithm is able to produce 
a binary classifier accompanied by noise statistics that reflect the quality of the learned classifier. We 
also show that the algorithm — in its simplest form — reduces to a smoothed version of AdaBoost. 

In developing a Bayesian model for aggregating binary classifiers, we begin with the logistic re- 
gression model proposed by Q. Given a set of base classifiers, the latent variables of the model 
are the weights placed on these base classifiers. We then introduce variables to account for label 
perturbations. Finally, we use variational inference to estimate the posterior distributions. 

Our ideas lead to a new boosting-like algorithm called VIBoost — boosting stemming from 
variational inference. AdaBoost employs a greedy search for incorporating new base classifiers. 
Similarly, in VIBoost each main-loop iteration introduces a new base classifier, which induces a 
new model. With this new model, variational inference is applied using previous values for a warm 
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start. In the process, noise statistics are cultivated. Our experiments reveal that VIBoost performs 
on par with AdaBoost and supplies meaningful characterizations of the label perturbations. 

Much has been done to cast boosting in a statistical setting. Friedman et al. [7] leveraged the logistic 
regression model and then used a functional gradient to derive a boosting update. Collins et al. O 
used information geometry to derive AdaBoost and algorithms emerged with exponential and lo- 
gistic loss objectives. Lebanon & Lafferty |[T5l solidified the relationship between AdaBoost and 
maximum likelihood via duality. These ideas led to a Bayesian perspective of boosting and pro- 
vided a way to incorporate prior knowledge fZT\ . 

There have been many approaches for handling noise. For example, Servedio addressed label noise 
in a PAC learning framework and developed SmoothBoost 12211 . Through a statistical formulation, 
Krause & Singer lfl4ll addressed noise in the context of symmetric, random label inversions, and 
devised algorithms to alleviate the resulting adverse effects. In one of these algorithms they used 
expectation maximization to construct a classifier while simultaneously updating a noise parameter. 
Building on this work, we use variational inference and address label noise in the process. 

The paper is organized as follows: the initial groundwork for the Bayesian model is given in ^2] In 
^3]we introduce two probability distributions that will play a role in the model. The proposed model 
is presented in ^4]and variational inference is applied in ^5] We discuss the connection to AdaBoost 
in ^6] We give experimental results in ^7] and we conclude in ^8] 



2 The Core Model 

In the binary classification problem we are given a set of N labeled examples {(x n , y n )}n=i- Each 
example is an element of some space X and the labels are elements of { — 1, +1}. In addition to the 
labeled examples, we also have a set of M base classifiers T = {fi, . . . , Jm}- Each element of T 
is a function that maps X to {— 1, +1}. Additionally, we assume that (a) h 6 T =>■ — h ^ T and 
(b) hi, h,2 G T =>• 3 i E {1, . . . , N} such that /ii(xj) ^ /i2(xj). These assumptions ensure a finite 
number of classifiers and prevent identifiability problems. 

For a fixed x g X, suppose the logarithm of the "+l"-to-"— 1" label odds is given by F(x) — 
we can form the conditional probability mass function for the labels as 
p(y | x, F) = i +exp (l ;/ j'( x )) ■ A label sampled in this way shall be called a true label. Logistic 
regression models the spatially-variant log-odds -ratio as a weighted sum over all base classifiers, 
i.e., F(x) = J2™=i c™/m(x). 

Consider a model defined by the following generative process: 

1. Draw c m ~ V c (m = 1, . . . , M) . 

2. Construct F = Y£H=i Cm $ m • 

3. Draw Xi : jv independently according to some distribution over X . 

4. Draw y n e {-1, +1} independently according to p(y n | x„ , F) = i +exp [_y„F(x„)] ■ 

The graphical model is shown in Figure [T] 

The latent variables are the base classifier weights ci-m- Using the labeled examples, we seek the 
posterior distribution over the weights. A similar approach was posed by Minka [ 17 1 with the Bayes 
Point Machine [ill]. In contrast with our work, the author considered (i) a linear classifier without 
the notion of base classifiers, (ii) expectation propagation as opposed to variational inference, and 
(iii) a Gaussian prior for the weights. 

The posterior distribution over the weights reflects a compromise of the observed data 
( {x n , y n }n=i ) w i tri our prior beliefs (Vc)- It also has the potential of yielding a classifier via 
the M-dimensional mean or mode, for example. Combining prior beliefs with observed data is 
made easier through conjugacy, which is how we propose a distribution for Vc- This is the subject 
of the next section. 
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Figure 1: The core graphical model for the 
boosting problem. Each label depends on the 
example and log-odds-ratio function. The only 
latent variables are the base classifier weights. 




z 



-2-1 1 2 3 i r > 



Figure 2: For a given versatile logistic with unit 
multiplicities, the negative logarithm of one of 
the product terms is shown above (fi = 1 , 7 = 2). 
Each curve of this form is tightly lower-bounded 
by a piecewise-linear function with one knot at 
z — j. The slopes are and /3, 



3 The Versatile Logistic & Binary Logistic Distributions 

We use two conjugate distributions to specify the model described in Figure[T The first distribution 



is used as a prior for the weights Ci-.m, and the second is associated with label generation. For 
vectors (3, 7 e R K and fi G , we define the density over the reals 

p(*) « nf=i ( 1+exp[ / fc(z - 7 ,)] ) w (i) 

to be the Versatile Logistic Distribution — written v-Log(/3, 7, fj,) — with slope vector (3, knot vector 
7, and multiplicity vector fi. Figure [2] provides the motivation behind this nomenclature. Define 
u = [ii] el 2 . A familiar density is v-Log(u, 0, 1), which is a logistic distribution. 

The density described in ([T} is valid if and only if there exists both a positive and negative slope with 
corresponding positive multiplicity. Consequently, we must have K > 2. Additionally, this distri- 
bution is unimodal, so it is reasonable to estimate its mean with an approximate mode. We prove 
these facts in 5S.1 The product represented in ([TJ relates to a Product of Experts [12|; however, 



each factor by itself does not correspond to a valid density. 

We now define a probability mass function for the binary random variable Y taking values in 
{—1,-1-1}. For scalars z, /?, and 7 we define 

p(y) = 7- r~~?7 vi (2) 

1 + ex$[-yp{z - 7)] 

to be the corresponding Binary Logistic Distribution, written b-Log(z, j3, 7). In comparing (|2| to 
label generation in our model, we see that f3 and 7 encode base classifier information. The versatile 
logistic and binary logistic are conjugate in the following way: if z ~ v-Log(/3,7, fi) and y n \z ~ 
b-Log(z, n , </>„) — drawn independently for n = 1, . . . . N — then the posterior of z given yi-N is 
also a versatile logistic with parameters 

= . . . , p K , - yi 6x, -y N 6 N } T g R K+N (3) 

7' = [7l,...,7i^,0l,•■•,^] T e]R K + A, (4) 

= [»i,..., f i K ,i,...,i] T eR K+N . (5) 

In the binary classification problem, the b-Log-v-Log conjugacy relationship helps with posterior 
inference. By construction, the posterior distribution of the weights is a versatile logistic. 

4 Incorporating Noise 

We will now build upon the model presented in ^2] Suppose we fix an instance x G X and repeatedly 
generate labels from p(y \ x, F). If the labels are true, then the empirical ratio of plus-to-minus 
labels will converge to cxp[i 7 '(x)], i.e., the odds ratio. 
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On the other hand, according to our model, if the labels are noisy, the empirical odds ratio converges 
to some fixed value, which is independent of x. Let be this noise-related odds ratio. Equivalently, 
£ is an instance-independent, static log-odds-ratio, which we refer to as the noise grade. For exam- 
ple, noise grades of — oo, 0, and +00, translate to random label assignments of +1 with probability 
0, 1/2, and 1, respectively. 

Let w take on values in {0, 1} and encode whether a label is true or noisy. We can merge the two 
label types into the following conditional label probability: 



p(y I w, x, £,F) 1 + cxp[ _ y(u;j p (x) + 1 + exp[-pF(x)] + 1 + exp[-y£] ' (6) 

The role of w selects the label type: true or noisy. Treating w as a latent variable, we embellish the 
model of ^2] 

1. Draw c m ~ v-Log(u, 0, //q1) (m = 1, . . . , M) . 

2. Construct F = J2m=i c ™fm ■ 

3. Draw x^n independently according to some distribution over X . 

4. Draw£~ v-Log(u, 0,// l) . 

5. Draw 9 ~ Beta(Ci, C2) (C € R 2 + ) . 

6. Draw w n \6 ~ Bernoulli(6») (n=l,...,N) . 

7. Draw y n £ {—1, +1} independently according to 



p(y n I w n , x n , £, F) = — = , — — v — — -r . (7) 

1 + exp[-y n (w„F(x„) + (l-w„)0] 

There is now a prior assigned to c m , and the new steps (4-6) model noise. The graphical model is 
depicted in Fig ure [3] Although not immediately apparent, this model subsumes label inversion as a 
form of noise (§S.2|i. 



From the classification standpoint, the primary latent variables of the above generative process are 
still ci-m, or the weights. The latent wi-.m, or type selectors, are responsible for the type of label 
generated. They are drawn independently from 9, the type prior. We can reason that 9/(1 — 9) 
represents a signal-to-noise rati o (SNR). This stems from the expected value of N9 true labels and 
7V(1— 9) noisy labels (see |S.2| for full details). Alternatively, we can use the prior of 9 for the SNR 
estimate, yielding E{9}/E{T=-e} = C1/C2. 
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5 Variational Inference 



With a Bayesian model in place, our focus turns to the posterior distribution of the latent variables. 
This allows us to construct a classifier by estimating the mean or mode of the posterior weights (c m ). 
We accomplish this with stagewise variational inference^ 

Full Variational Inference. Before motivating our stagewise approach, we first review variational 
inference. In our model we have the observed variables (J 7 , xi : jv, Ui-.n) and the latent variables 
(ci-.m, £, wi-.n, 9). We are interested in the posterior p{c\-M, £, Wv.Nt 9 | J 7 , Xi : jv, Vi-.n), which is 
proportional to the joint p(ci-m, wi:N, 9, T , Xun> Ui-.n)- In variational inference, we introduce 
a distribution q(c\-M, t^i-jv, &) to bound the log of the marginal probability of the observations 

ma, 

logp(J", xi;jv, yi-.N) > J q(cuM, £, wi- N ,9) logp(c 1:A f , £, Wun, J 7 , x 1:Ar , j/i : Ar)dci : Md^d?j;i : jvd^ 

- fq(ci : M,€,wi: N ,6)\ogq(ci : M,€,wi :N ,6)dci :M ^dwi-.NdO ■ (8) 

The right-hand side of <[8j is referred to as the evidence lower bound (ELBO). Using the KL- 
divergence, we can also write 

logp(J",xi : jv,2/i:Jv) = kl( q(ci; M ,€, wi:N,0) II posterior ) + ELBO . (9) 

The KL divergence provides a measure of closeness between the auxiliary distribution and the pos- 
terior. We maximize the ELBO with respect to the parameters of q, thereby minimizing the KL 
divergence to the posterior. We use mean-field variational inference, i.e., we assume a factorized q: 

q(ci:M,£,w 1:N ,9) = Yl m=1 q(cm) ■ q(0 ■ Hn=l<l( w n) ■ #) ■ (10) 

Each component of the factorized variational distribution has a form and variational parameters. 
For example, a reasonable form of q(c m ) is a versatile logistic with variational parameters given by 
some slope, knot and weight vectors. Typically, we optimize the parameters with coordinate ascent, 
updating each in turn, holding the others fixed. In our model, this yields the following updates (TJ: 

logg*(c m ) <-E,[logp(Ci C m +i:M,E,Wi:jv,0, J r ,x 1 . N ,y 1:N )} + const (11) 

log q* (£) <- E,pogp(Ci:jitf , f , W 1:N , 6, IF, x UN , y 1:N )] + const (12) 

\ogq*(w n ) <- E q [logp(C 1:M ,E, W 1:n - 1 ,w n ,W n+1 : N ,e,J : ,x 1 : N ,y 1:N )} +const (13) 

]ogq*(0) 4- E q \\ogp(C 1:M , E,W 1:N ,6, T, x UN ,y 1:N )} + const . (14) 

Each term on the right is a leave-one-out expectation over the latent variables, resulting in a function 
of the corresponding left-out latent variable. Running the variational inference algorithm repeatedly 
cycles through these updates. 

This algorithm is not convenient. The chosen form of the approximate posterior weight distributions 
is a versatile logistic. From conjugacy, the number of parameters required to specify each distri- 
bution is linear in the number of examples (N). Additionally, we hope to use a large number of 
base classifiers, even for small datasets. Thus, for our classification problem, cycling through all 
auxiliary weight distributions is impractical because integrating over the weights is too much of a 
computational burden. 

Stagewise Variational Inference. To address these issues, we propose a dynamic model over the 
current static one: with a current estimate of F — Yl m c mfm, we introduce a single base classifier 
and then run variational inference on the latent (c, £, wi-.n, 9). The regression counterpart would be 
Forward Stagewise Regression, a greedy algorithm which finds a sparse subset of covariates and is 
structurally similar to AdaBoost ifTUl . 

In each main loop iteration, let H(-) be the current estimate of the true log-odds-ratio F(-) and 
suppose we have a "promising" candidate h G J- that we wish to merge with H. This promising 
classifier is found greedily (details below) and, once found, becomes a fixed variable in the model. 
Now, rather than M latent weights, we have a single latent weight c corresponding to h. Every 

'The common approach in this situation is to use a Gibbs sampler. The Gibbs sampler for the graphical 
model of Figure|3]is given in j jS,3| We used Adaptive Rejection Sampling |9| to sample a v-Log. In practice, 
this approach was too time consuming, which is why we turned to inference. 
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Algorithm 1 VIBoost 



Input: {(x n ,y n )}^ =1 , T, ^o e K+, Mo G R+, C G 
Initialize iJ : A" — > {— 1, +1} to the zero function 
Initialize rj e M^_, u> 6 and e [0, 1]^ 
Define /3(/i) = [+1, -1, -yift-(xi), . . . , -y N h(x N )} T 
Define /i) = [0, 0, -H(xi)h(xi), . . . , -H{x N )h{x N )} T 

Define = [/i O ,Mo,0i, ■ ■ ■ ,<Pn] T 
for t = 1 to T do 

/it <- arg max,,^ |Mode[v-Log(/3(ft),7(iI, ft),/x(0))]| 

while ELBO increases significantly do 

a t <- Mode[v-Log(/3(/i t ), 7(77, /i t ), /x(0))] 
<*>i <- Mo + E«=i( 1 - <f>n)Hy n = -1} 

^2 <" Mo + ££=l(l ~ 0n)l{y« = +1} 

exp [\b(77i) - \p(?? 2 ) + ^(c^ ) - iK^lfan = + 1} - tKm)l{j/n = -1}] 
1 + exp[-y n (H(x n ) + a t /it(x„))] 

0n V- Kn/(1 + ftn) 
»7l <~ Cl + En=l^« 

end while 

H ^ H + a t h t 
end for 

Output: classifier sign{i/(-)} 



update of H induces a new model to which we apply variational inference. This new, time varying 
graphical model is featured in Figure |4] 

Let T> denote the evidence, i.e., the observed variables xi : jv, y\-N, H, and h. At each stage we 
assume the following distributions: 

p(c | X>) «g(c |/3, 7, M)~v-Log(/9, 7, m) p(f I T>) « ?(£ I «) ~ v-Log(u, 0, w) (15) 
p(io fl | 2?) « ?(w„ | <£„) - Bernoulli(0„) p(0 \ V) a q(0 | 17) - Beta(r7) . (16) 



The variational updates and the ELBO are derived in 5S.4 and § S.5 respectively. The general 
approach is to isolate the terms of the log-likelihood that feature the variable of interest — all other 
terms will cancel after normalization and are extraneous. We then take expectations and attempt to 
identify a distribution. 

The resulting algorithm, VIBoost, is presented in Algorithm[T|( t|>(-) is the digamma function ). Pos- 
sible modifications include (a) fixing the number of variational inference iterations so that ELBO 
calculations are avoided, and (b) setting the at once and skipping its update in the variational infer- 
ence loop. 

We greedily select the next base classifier by finding the v-Log posterior with maximal mode. A 
large mode suggests that the corresponding weight possesses discriminative classification strength. 
We opted for the mode rather than the mean; we now justify this choice. 

A v-Log distribution with more than two slope/knot/multiplicity terms has the advantage of being 
a one-dimensional density, but is cumbersome when evaluating statistics of interest. Computing 
the normalization constant, mode and mean require iterative techniques, which can bog down any 
algorithm. However, if we replace fi^ log(l + e /3fc< - z ~ 7fc '), a summand of the log-density, with the 
single-tail approximation [i^ hT ^ z ~ lh ^ (r > 0) we arrive at the modal estimate of 

1 , ( T. k .. Pk<0 ^™" \ n ~ 

« = ^i°g( E ^; >0 ^ e -^J > d7) 
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Figure 5: The effects of the ver- Figure 6: Spam dataset, classifi- Figure 7: State dataset, classifi- 
satile logistic prior on classifier cation error cation error 

weights (no = w„ = r = 1) 



where /3 = |/3fc| is constant (| S.4.1 and [7]). For a unimodal distribution, this closed-form expres- 
sion provides an efficient way of estimating expectations. Thus, in avoiding numerical integration, 
Algorithm[T]is performing approximate variational inference. 

6 Relation to AdaBoost 

We now compare our algorithm to AdaBoost. Consider the simpler model of ^2] a true-label dataset 
with prior assignments (Figure [T}. This leaves the greedy step of finding the maximal mode in Al- 
gorithm[T]and updating H without the variational inference. We now investigate the approximation 
supplied by {FT) with fx = r = 1. Let Z = Yl=\ e- y " HM and d„ = e~ y " H ^ /Z so that 
^2n=i dn = The approximate mode a is 

ilog(^M), (18) 



where e = J2 n 41{^(x„) ^ y n } is a weighted error ascribed to the new classifier (§S.6 1. When 
compared to AdaBoost, the update is identical when the 1/Z term is not present. Effectively, the 
1/Z term results in a shrinkage of the assigned weights (see Figure|5]). 

The variable Z is equal to the current exponential loss. If Z is small then 1/Z is large which leads 
to a dampened weight assignment (and vice versa). Assuming the exponential loss decreases with 
more iterations, the algorithm acts like AdaBoost early on and then becomes more conservative 
with each iteration. Quinlan |[T9l incorporated similar smoothing for AdaBoost and described it as 
"necessarily ad-hoc". In the proposed model, this smoothing results from the prior assignment. We 
also note that AdaBoost selects the base classifier that minimizes e. From (1 8) , this coincides with 
the largest approximation-based mode. 

The slopes (5{h) as defined in Algorithm [T] contain individual [mis]matches of the base classifier 
with the labels, whereas the knots contain individual, weighted [mis]matches of the base classi- 
fier with the current log-odds-ratio estimate. The prior effectively augments the data by inserting 
two phantom examples. Each example lies in the zero level set of H as indicated by a knot of 
(H (x)/i(x) = => -ff(x) = 0). The slopes of ±1 presume that the base classifier succeeds in 
correctly labeling one of the pseudo-examples, while failing with the other. 

Finally, leveraging di-.jy we can rewrite Algorithm [T] to use the weighted error e rather than a mode 
search. Using these errors for ranking the base classifiers — as done in AdaBoost — decreases com- 
putation time significantly when searching for a new candidate base classifier. The greedy search in 
VIBoost would then closely match AdaBoost's in computation time, thereby leading to an efficient 
algorithm with a similar runtime to AdaBoost. 



7 Experiments 

We studied VIBoost on real and synthetic data. We found that VIBoost works as well as AdaBoost 
for binary classification. More importantly, we show that the variables accounting for label noise 
are a meaningful diagnostic of misfit. For all experiments, our VIBoost initialization was /zrj = n' = 
4> n — Q = rjj ■= t = 1. Setting r = 1 provides the closest means of comparison with AdaBoost. For all 
experiments we used decision stumps as our base classifiers. Using the variational parameters of the 
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Figure 8: Step dataset, Signal- Figure 9: Step dataset, Noise Figure 10: Long-Servedio 
to-noise ratio Grade dataset, Signal-to-noise ratio 



type prior, we use 771/772 for the SNR. For the noise grade we use log(tJ2/wi), the mode associated 
with the approximate posterior. All results presented are average values calculated over 40 runs. 

As VIBoost outputs a classifier, we investigate classifier quality on two real-world, text datasets. 
The first dataset is the 57-feature spam dataset [4|. With 6,401 examples, each run trained on a 
random 10% and tested on the remaining 90%. The second dataset is a state dataset [8 | comprising 
145 documents with 22,648 features (bag of words). Instead of the word count, however, we used a 
present/absent binary value. Each document relates to Illinois or Michigan. We trained on 30% and 
tested on the remaining 70% (random splits). Error results are featured in Figures[6]and|7]and reveal 
that VIBoost and AdaBoost performed similarly. 

In addition to a classifier, VIBoost also provides noise statistics. Using a synthetic dataset, 
we now look at the algorithm's estimate of the posterior SNR (771/772) and the posterior noise 
grade ( log(w2/wi) ) after 50 iterations. We simulated 100 examples on the real line with X = 
{—99, —97, —95, . . . , +99}. Following the generative process of $m we constructed the step dataset 
with F(x) = +00 for x positive and —00 for x negative (+1 label for x positive and —1 label for x 
negative). With a noise grade of log 3 ~ 1.1, we varied the type prior, 9, of the generative process. 
Figures[8]and|9]respectively show the SNR and noise grade with varying 8. In a pure-noise situation 
(9 = 0) the SNR is at its lowest and the noise grade is best estimated. Conversely, in the absence of 
noise (9 = 1) the SNR is at its greatest, rendering the noise grade estimate irrelevant. 

The last dataset we considered in this paper was also simulated. Inspired by lfl6l . we constructed a 
1,200-example, 31 -feature Long-Servedio dataset — a dataset that provably "breaks" AdaBoost and 
many other algorithms with a convex loss minimization. The details can be found in [20, §12.3] 
and Matlab code is included in S jS.7| Each run comprised 200 training examples and 1000 testing 
examples (random splits). Following |20|, the noise level was set to 0.20. In our context, this 
translates to a type prior of (always reassigning a random label) and a noise grade of « —1.4. 
Not surprisingly, AdaBoost and VIBoost do not succeed in finding a decent classifier for this set. 
However, the SNR values produced by VIBoost indicate that poor classification should be expected 



(Figure 10 1. As a result, the algorithm shifts its focus from classification to noise quantification. 



As variational inference navigates through a vast set of auxiliary distributions, our only verifiable 



means of efficacy is provided by the ELBO (§S.5 1. Empirically, we have noticed that ELBO in 



creases are larger in the beginning main-loop iterations. As the algorithm progresses, changes in the 
ELBO are quite small and sometimes negative (and small). The small changes are expected because 
the composite classifier's accuracy is improving. We hypothesize that the negative changes stem 
from the modal approximation used in a u; n -update expectation. Alternatively, the ELBO requires a 
v-Log normalization constant, which we compute numerically and may be inexact. 



8 Conclusion 



We have developed a new boosting-like algorithm. VIBoost attempts to fit a posterior distribution 
by applying variational inference to a dynamic model. We began with a model centered around the 
binary classification problem and augmented it hierarchically to account for noise. 

We did not set out to improve AdaBoost. In addition to forming a binary classifier, the Bayesian 
model facilitated a label noise extension and we were able to extract information beyond classifica- 
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tion. We have observed experimentally that a good classifier is accompanied by a large SNR. The 
SNR may explain why a poor classifier is returned by the learning algorithm. We demonstrated this 
by analyzing the Long-Servedio dataset. 

This model and accompanying algorithm are fertile ground for future work. This paper did not ad- 
dress multi-class problems or regression. We can also extend our model by forming connections 
between instances, base classifiers, and classifier weights (currently, these three features are condi- 
tionally independent given the labels). We can also form dependencies between instances and label 
types, modeling varying levels of noise throughout the instance space. 
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S.l Properties of the Versatile Logistic 

The set of nonnegative reals is denoted M + . For vectors (3, 7 e R K and fi £ R 1 ^ let 

/(*) = n f?T — — vY" =U r ^ > w 

V 1 + CXp^fe (z - Tfc)] y 

where r k (z) = ( 1+exp[ / fc(z _ 7fc)] ) We note that 

1 

1 + exp[/3 fc (z - 7 fc )] 
,1 +cxp[/3 fc (z - 7 fc )] 



< - < 1 (20) 

°<(t^ nn xt) =»- fc («)<i (21) 



and 



l 1 k 



Tk{z) = I — r^7 yT ) < ( F7t t~ vT ) = exp[-^ fe /3 fe (z - 7fe )] . (22) 

\1+ exp[/3 fc (z - 7 fe )] y \exp[/3 k (z - 7fe )] y 

Lemma S.l.l. The integral I — f^°° f(z)dz is finite if and only if there exists an i and j such that 
^ > 0, Hi > 0, 13 j < 0, and Hj > 0. 

Proof. If there is a k such that /3k = or fik = then r k (z) is a constant and does not contribute to 
the finiteness of the integral. Therefore, without loss of generality we can assume that none of the 

j3f. or [if. are zero. 

(=>) With r k (z) nonnegative and bounded by 1 we have /(z) < rk{z) for all k and z. It follows that 

/■+00 

/ < / minr k (z)dz (23) 



r>+oo 

< / mm{ri(z),rj(z)}dz (24) 

-00 

t r~\-00 

rj(z)dz+ / Ti(z)dz , (25) 
—00 J t 

where t is the unique solution to r.i(t) — rj(t). To show that t exists, let g(z) = rAz) — r^z). We 
have lim z ^_ 00 g(z) = — 1 = — 1 and lim z ^ +00 g(z) = 1 — = 1. Also, 

g\z) = -Hjftil + exp[ft-(z - lj )])-^-\e^[p j {z - 7i )]) 

+ + exp[&(z - 7 , i )])~ AI ^ 1 (exp[A(z - 7i )]) > . (26) 

With g increasing it crosses the z-axis once (intermediate value theorem) at t. From ( f22) we have 

/t r+00 
exp[-/jijPj(z-'Yj)]dz+ exp[-Hif3 i (z-'Y i )]dz (27) 
-00 J t 

is finite because each integral on the right-hand side is an integral of an exponential tail. 
(<^) Assume all of the (3 k are negative. As z — > +00 f(z) will approach 1, So there exists a z\ G R 
such that /(z) > 1 /2 when z > z\, leading to a divergent integral. The analogous case can be made 
for the f3 k all positive. The only other choice is that there is a > and f3j < 0. □ 

Corollary S.1.2. The density p(z) cx f(z) is valid if and only if there exists an i and j such that 
fa > 0, yn > 0, /3j < 0, and Hj > 0. 
Lemma S.1.3. log r k (z) is concave. 



Proof. If (3k — or /i^ = then r k (z) is a constant function, which is concave. Otherwise, we have 

d 

dz 

d2 , $expH9(*-7fc)] 



d , / \ p k exp[p k (z - jk)} fi k 

— log r fe z = — — = — — — (28) 

dz 1 + exp /3 fe (z - 7 fe ) 1 + exp -Pfc(z ~ 7k) 



^^'■''-" ih'^I ,-W <0 (29) 

proving concavity. □ 



10 



Lemma S.1.4. The distribution v-Log(f3, 7, /lx) is unimodal. 



Proof. If p(z) is the associated density then we wish to show that p(z) has one critical point. Since 
p(z) > 0, logp(z) will have the same critical points as p(z) because d(\ogp(z))/dz = p'(z)/p(z). 
From the previous Lemma, logp(z) is concave. Being a valid density over the reals, concavity 
ensures that log p(z) will have one critical point as it increases and then decreases. □ 

S.l.l Onv-Log(/?u,7l,[Mi,M 2 ] T ) 

Recall that u = [+1, — 1] T and we will assume f3 > 0. Let V ~ Beta(/ii, /i 2 ) and g(v) = 7 + 
j 5 log — l). The function g is monotonic and maps [0, 1] to R. The inverse function is g~ 1 (z) = 

i+cxp[^ (z - 7 )] - Note that 1 - S^iz) = 1+cxp[ _^ (2 _ 7)] . If ^ = then we have 



Pz( 2 ) 



|<?»| r( M i)r( M2 ) 



0v(l-v) 



r(Mi +M2) 



/5 



r( M i)r(/x 2 )" Vl + exp[/?(z-7)] 



r(Mi)r(M 2 ) 
1 



0^(1 - vy 



1 + exp[-/3(z - 7)] 



Thus, Z ~ v-Log(/3u, 7I, [^1, ^ 2 ] T ). The normalization constant is 

1 r(Mi)r(/i 2 ) 
/?r( Ml + M2 ) ' 

S.1.2 The Exponential Family 

A density of the form 

p(z I 77) = exp{?7 T t(z) - a(r))} 
is said to belong to the exponential family. If we set h(z) = 1, define the sufficient statistics 



(30) 
(31) 

(32) 



t(z) 



log (1 
log (1 - 



(33) 



(34) 



and set the natural parameters 77 = — \i £ R K , then v-Log(/3, 7, ^i) is a member of the exponential 
family. Observe: 

/i(z)exp{r 7 T t(z)- a (r ? )}-cxp|-^ Alfc log(l + e' 3fc ( z -^)) -o(-/i)| (35) 



K 



e -a(-|*) JJ ( - 



fc=l 



1 



+ e 0k{z-ik) 



A' 



oc 



n 1 



fc=l 



1 



-I- e /3fc(z-7fc) 



(36) 
(37) 
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S.2 Accounting for label inversions 

Let p — [pi, p2, p 3 ] T denote a probability vector and let v G {— 1, +1} denote a label. We now 
form y, a stochastic mapping of v, as follows: 

1. With probability pi, y ■<— v [equality] 

2. With probability p%, y < v [inversion] 

3. With probability p 3 , y <— < w/piobr [independent Bernoulli trial] 

[ — 1 w/ prob 1 — r 

As outlined above, the formation of y from v possesses 3 degrees of freedom: r and two elements 
of p. We can create the equivalent stochastic mapping: 

1. With probability 8 = 2pi + pa — 1, y <— v [equality] 

2. With probability 6 = 1 - 9 = 2 - 2pi - p 3 , 



-1 w/ prob s — . , . 

r 2-2pi-p 3 



l-Pl-P3(l-'~) 

[independent Bernoulli trial] 



1—1 w/ prob 1 — s 

To show equivalence, we have: 

P(Y = +1 | V = +1) = 6 + 6s [= 1 - a] (38) 

= (2 Pl + p a - 1) + (2 - 2 Pl - p 3 ) ^Pi-^l-O 

2 - 2pi - p 3 

= Pi + p 3 r (40) 

P(y = -1 | V = +1) = 0(1 - s) [=a] (41) 

= (2-2 Pl - P3 ) fl- 1 "/ 1 "^ 1 " ) (42) 
V 2 - 2/Ji - P3 / 

= l-/Ji-P3 + Ps(l-r) (43) 

= P2+Pa(l-r-) (44) 

P(Y = +1 | y = -1) = 0s [= b] (45) 

= (2-2pi-p 3 ) — - — (46) 

2 - 2/Ji - P3 

= 1 - pi - p 3 + p 3 r (47) 

= P2+ Psr (48) 

P(y = -1 | V = -1) = + 0(1 - s) [=1-6] (49) 

= (2pi + p 3 - 1) + (2 - 2/0i - p 3 ) f 1 - Wi-^a-rA (5Q) 



2 - 2pi - p 3 

= (2pi+p 3 -l)+p 2 + p 3 (l-r) (51) 

= pi + p 3 (l-r) . (52) 

The above also describes a Binary Asymmetric Channel (BAC) |18| with parameters a and b (See 
Figure 111. When we expect a balanced dataset, i.e., the expected number of +1 and —1 labels are 

(53) 



equal, the ratio of true labels to noisy labels is 

f(l-a) + f 



N (1 - a) + f (1 - 6) _ 2 - (a + b) _ 2-6_ _ 1_ 



2 „ (a + 6) 1-0 ' 

which is lower bounded by 1. Looking ahead, represents a random quantity with expectation 
■ Using this expectation in place of 0, the above ratio becomes 

^#^-^^-1 + 2^^1 + 2-^1%, (54 ) 

i-^+V^ m m E[i-e] 

thus motivating the use of grf^gi ■ 
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Figure 11: The Binary Asymmetric Channel, P(Y \ V). 



S.3 The Gibbs Sampler 



This section refers to the original noise model. The joint, denoted J , is 

^0 / 1 \ Mo 



J OC 



(t^)"(t^)° x ^ I(1 -^' 

M / -i \ Mo / i \ Mo N 



m— 1 
N 



n 



1 + e- c " 
1 



(55) 



1 V l+GXp -y n Em=l C mfm{^n) 



1 + cxp [-y n £] 



For variable z, let J^z] denote the distribution z with all other variables fixed. Starting with Cj, we 
have 



i y° / i 

1 + e c * y V 1 + e~ c 

AT 

n 



n =i V 1 + exp 





/ 


+1 " 











\ 






-1 









Mo 




v-Log 




-yi/i(xi) 




-/i(Xl)/i(Xi) 










V 


-2/n/»(xjv)_ 




_-/i(xjv)/i(xAf)_ 






/ 



where /i(x) = Em#; c m / m (x„). Next we consider f: 



1 

1 + e« 



v-Log 



Mo / 1 \ Mo 

1 + tr< 



N 









"o" 




( 


-1 










x n 1 

1J - V 1 + exp 

n— 1 v 

M0 + Ey„=+l( 1 - W n) 



exp [-y„£] 



1 — tu^ 



(56) 

(57) 
(58) 

(59) 



(60) 
(61) 
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For Wi, we have 

j[wi] oce^ii-e) 1 -™* 



1 + cxp 
= Bernoulli 
Finally, for 9 we have 

J[8] c 



Vi Em=l c mfm{ x i) 



1 + cxp [-yii] 



1 — Wi 



l+exp[- ai E" = iC m / m (x I )] 



+exp[-y i E" = iC m / m (x i )] l+exp[-te«] 



JV 



Ci-i 



JY 



N 



Beta Ci + & + Z^ 1 _ 



n=l 



The Gibbs sampler is given in Algorithm|2] 



(62) 
(63) 

(64) 
(65) 



Algorithm 2 Gibbs Sampler 



Input: {(x n , y n )}%= x , 7>o G R+,Mo G K+,C G R+ 
Initialize r] e E^_, a; € R+, and G [0, l] w 
for £ = 1 to T do 
for i = 1 to M do 



v-Log 



end for 

£ - v-Log 



/ 






hi 













"Mo" 


\ 








-1 













Mo 












-/i(Xl)/i(Xi) 








V 








/i(xAr)/i(xjv)_ 






/ 


1-1 









Mo - 










-1 


1 





5 


Mo - 






) 





for i = 1 to N do 
Wi ~ Bernoulli 



V l + nxp[-Si E" =1 =m/m(ij)] l+exp[- Bi £] 

end for 

0~Beta (Cl+En=l^n,C2 + Ell(l 
end for 

Output: Samples from the posterior 
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S.4 The Variational Updates 



S.4.1 The weight update (c) 

Isolating the terms of the log-joint (£) involving c, we obtain 

Ci = const - fi log(l + e c ) - /i log(l + e~ c ) 



N 



^ w„ log(l + exp[-y„(77(x„)+c/i(x„))]) 



(66) 



We only require the expectation with respect to wi : n: 

log q* (c | (3, 7, fi) = const - ^ log(l + e c ) - ^ log(l + e~ c ) 



A' 



<f) n log(l + cxp[-y„(i/(x n )+c/i(x n ))]) 



(67) 



Here we note that ^fy n {H(x n )+ch(x n )} — =Fy„/i(x„){c+i/(x„)/i(x„)} from the fact that 
/i(x„) e {—1,-1-1}. To the exclusion of this manipulation presumes binary logistics, 
b-Log(c, =Fy„/i(x„), —H(x n )h(x n )), and so conjugacy will come into play. The form presented 
in (|67|) parametrizes a versatile logistic distribution with parameters of length N + 2 given by 



(3{h) 



+1 

-1 
-yih(xi) 

-y N h(x N ) 



-Y(H,h) 






-£T(xi)/i(xi) 
-7?(x Ar )/i(xjv). 



M(0) 



Mo 
Mo 



(68) 



Before proceeding to the next update, we address modal estimation of the versatile logistic. Finding 
the mode requires minimizing the negative log of the density or Ylk=i 1°§(1 + e^ 2-7 *^) (the 
extraneous normalization constant is discarded). The objective of interest is a weighted LogLoss [3] 
and minimizing it can be accomplished iteratively. Alternatively, we can reason that for a fixed k, 
the quantity log(l + e /3fc ( z ~ 7fc ^) contributes most to the mode wherever the exponential term is small. 
Using a semi-tail approximation, we have log(l + e^ fc ( z_7fc )) f=s e r ^ fe( - z ~ 7 ' t - ) f or positive scalar r. 
Setting r = 1 best approximates the extreme part of the tail, whereas r = 1/2 will match the first 
derivative at z = 7^. If we restrict ourselves to slopes of equal magnitude, i.e., = f3 > 0, we 
now minimize 



p k >o 



-T0Jk 



/3 fc <0 



T/3~f k 



Taking the derivative with respect to z and setting equal to zero yields the approximate mode 



a= 2^ l0g 



S.4.2 The noise grade update (£) 

Starting from the log-likelihood, we have 



J2k-.p k <o Mfe e 



+T/3 7 fc ' 



-r/3f k 



(69) 



(70) 



N 



(71) 



£ 2 = const - f/ log(l + e e ) - // log(l + e e ) - ^(1 - w n ) log(l + exp[-y„£]) 

n=l 

We only require the expectation with respect to wi-.n- 

N 

logg*(£ I w) = const- //[,log(l+e^-/i o log(l+e- s )-^(l-0 n )log(l+exp[-y„e]). (72) 



n— 1 
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The above parametrizes a versatile logistic with slope vector u, knot vector 0, and respective multi- 
plicities 

N 



U>1= Ll' Q + J2( 1 -<t ) n)Myn = -l} (73) 

71=1 

N 

U) 2 =n' Q + ^{l-<t> n )\{y n = +l} . (74) 
n=l 

S.4.3 The type update (w n ) 

The relevant terms of the log-likelihood are 

£ 3 = const + w n log 9 + (1 — w n ) log(l — 9) 

- w n log(l + exp[-y„(if(x n )+c/i(x n ))]) - (1 - w n ) log(l + exp[-y„£]). (75) 

Expectation with respect to c is done via the approximate mode a, i.e., we replace c with a. We 
now consider v-Log(u, 0, [w l7 W2]); the approximate posterior of £. Elementary calculus reveals the 
mode is log (t^/^i) which is in exact agreement with (|70|i for r = 1/2. Supposed ~ Beta (o;i ,^). 
If Z = log(^ — 1), then it can be shown that Z ~ v-Log(u, 0, [wi,W2]) (see fS.l.ll. In this 

particular case, both distributions have the same normalization constant of ^fe^^ and leveraging 
the relationship of V with X, we evaluate 

E 2 {log(l + e +z )} = -E„{log 7} = t|>(wb) - i|>(«i) (76) 

E,{log(l + e- z )} = -E„{log(l-y)} = il)(wo) - 1|>(W2) , (77) 

where wo — + cl>2 and is the digamma function. (The above two expectations can be used 
to derive the differential entropy.) 

Taking the expectation of ( |75| ) with respect to £ using ( f76*| l and ( |77j ), we obtain 

log9*(w;„ I (f> n ) w const + io n [i|)(77i) - ^(??o)] + (1 - w n )[i|i(i|2) - *K%)] 

- w n log(l + exp[-?/„(i/(x„)+a/i(x„))]) 

- (1 - w n )[i\>{u ) - ib(w 2 )]l{yn = +1} 

- (1 - w„)[i|>(wd) - *(wi)]l{l/„ = -1} , (78) 

where 770 — ^71+^2- If 

/ „ , \ A CXP^fa) -M^fa) + -A>{^2)HVn = +1} -l|)(^l)l{y„ = -1}] 

1 + exp[-y„(if(x„) + ah(x n ))\ 



then ( |78j ) describes a Bernoulli distribution for w n with parameter <j> n = K n /(1 + K n ). 

S.4.4 The type prior update (0) 

Again, starting from the log-likelihood, we have 

N 

£4 = const + (Ci - 1) log 9 + (C2 - 1) log(l - 0) + ^2 K log 8 + (1 - w n ) log(l - 0)] (80) 



Taking the expectation with respect to wx-.n gives 

log q* (8\rj)= const + ^ _ 1 + ^ ^ log + ^2 - 1 + X> - ^ log(l - 9) (81) 
which corresponds to a Beta distribution with parameters 

N 

*?i = Ci + (82) 



n=l 
JV 



m = 



C2 + ]T(l-0n) ■ (83) 
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S.5 The [Dynamic] ELBO 

We are interested in the additive change of the ELBO. The log of the joint is 

C = const - // log(l + e«) - // log(l + e" £ ) - /i log(l + e c ) - log(l + e~ c ) 

AT 

+ (Ci - 1) log*? + (C 2 - 1) log(l - 6) + J2 ™» log8 + (1 - w n ) log(l - 0) 

n=l 

N 

- 53 w ™ lo §( 1 + ex Ph2/«(^( x «) + ch(x n ))]) + (1 - w «) log(l + exp[-y„£]) (84) 

n=l 

We now take the expectations with respect to the auxiliary distributions: 

2 

E Q {£} = const - J2 MoM^i + wa) - - MoE c {log(l + e c )} - //oE c {log(l + e~ c )} 

i=i 

2 

3=1 
N 



- MMm + m) - A>M) + (i - 6»)W>fai + »») - ^M) 

n=l 

AT 

- 5^(1 - </>„) [(i|>(ui + cj 2 ) - i|>(wi))l{i/ n = -1} + (i|)(wi + w 2 ) - ^(w 2 ))l{y„ = +1}] 

71=1 

N 

+ exp[-y n (H(x n ) + c/i(x„))])} (85) 



n=l 



const - 53 MoOK^i + <*> 2 ) - - 53^' ~ + ^) - MVj)) 

3=1 3 = 1 

N 

- 51 MMm + m) - Mm)) + (i - 0n)(^(m + »») - M>M) 



71=1 

AT 



53(1 - <jy n ) [ty(ui + wa) - M^i))HVn = -1} + M>(wi + wa) - *(wa))l{tf n = +1}] 

n=l 
Ar+2 

53/ifelEc{log(l + e^( c -T'=)} (86) 



fc=i 



Expectation with respect to c refers to v-Log(/3, 7, /i,). Let £? c denote the normalization constant for 
this density. We now look at the entropy of the auxiliary distributions. 



c: 

AT+2 



logB c + 53 M feE c {log(l + e^ c -^} (87) 



fe=i 



£ : From § S.l.l we have 



log ( rj^+^j ) + (wi+W2>l»(wi +"a) " $>^(«i) (88) 



• u>i:iv: 

AT 



53 </>„ log 0„ + (1 - log(l - </>„) (89) 



n=l 
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log ( rfa + ^j ) + tii + m 2)Mvi + m) - X> Win) m 

Combining into a single expression, we obtain 

2 2 

ELBO = const + log£? c + ^]log V(u)j) - logr(w ) + (u - 2/Lt' )\b(w ) - ^JC^- - MoH^j) 

4=1 j'=i 

2 2 

3=1 3=1 
N N 

- ^ [4>n log^n + (1 - (^n)log(l - K)] ~ N^{r, a ) + ^ [^{m^n + - 
n— 1 n— 1 

iV 

-^M^WJ+^K) ]T (l-0„)+iK« a ) E C 1 -^) , (91) 

?i— 1 n:y n — — 1 n:y n — +1 

where the O-subscript denotes the vector sum (except for /1q). 
S.5.1 Computing the normalization constant 

The change of the ELBO requires the computation of log B c . Referring back to ( fT9| i we are inter- 
ested in B = J^°° f(z)dz. To find B we will employ numerical integration - a reasonable approach 
for a function of a single variable. Now, consider 

K 

g(z) = - log /(*) = £> log (l + e & ^) , (92) 

k=l 

and so B = f_ 00 e~ 9 ^dz. The function g is positive and convex. 

One problem with using numerical integration blindly is that for large z, log(l + e z ) might return 
infinity. For example, in evaluating log(l + e 5000 ), computational software will first perform e 5000 
and return a value of infinity. Subsequently, adding one and taking the log will also return infinity. 
This motivates the following Lemma. 

Lemma S.5.1. For [z] + = zt{z > 0} we have log(l + e 2 ) = log(l + e -l 2 l) + [z] + . 

Proof. If z < then z = —\z\ and [z] , = yielding equality. If z > then z = \z\ and [z], = z. 
We have log(l + e~ z ) + z = log(l + er z ) + loge z = log(l + e z ). □ 

Utilizing the above Lemma to evaluate log(l + e z ) ensures that infinite values are not returned from 
software. Revisiting the previous example, log(l + e 5000 ) log(l + e -5000 ) + 5000 « + 5000 = 
5000. 

Now suppose that 1000 is a lower bound on g(z). A numerical integration procedure would have to 
deal with numbers on the order of e~ 1000 , leading to a zero estimate of B. To avoid this pitfall we 
translate g(z). Utilizing a Golden Section Search can produce the minimum value of g(z). Let z be 
the scalar such that g'(z) = 0, i.e., z is the global minimizer of g. We can now consider 

»+oo z'+oo 

(93) 



/-t-00 r-t-00 
exp[— g{z)]dz = / exp[— g(z + z)]d 
-co J — 00 

/ + OO 
exp[-(g(z + z) - g(z))]dz (94) 
-OO 

logB = -g(z) +log (^J exp[-(g(z + z) - g(z))]dz\ . (95) 
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The translated function g(z) — g(z + z) — g(z) is nonnegative with as the global minimizer and 
.9(0) =0. 

Our last step before using numerical integration is the contraction of the integration limits. Using 
the substitution u = tan -1 x (or x — tanu) we arrive at 

,+tt/2 

B = cxp[— g(z)] / cxp[— g(tanu)] sec 2 (u)du (96) 

J— it/2 

( f +7T/2 \ 

logB = — g(z) + log / exp[— g (tan u)] sec 2 (u)du \ . (97) 
This final integral serves as the input to a numerical integrator to produce log B. 
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S.6 Approximate mode simplification 

Recall: y\ = 1, y n h(x n ) =+1o y n = h(x n ) and y n h(x n ) = -1 ^ y n ^ h(x n ). We form 

A , s (h p-»%) A 

Z = ]T 0„ e -^( x "> rf„ = ^ E^^k^W}. (98) 



n—1 n—1 



We have 



1 10 _ ( MQ + Eli cbne-^^t{y n = Mx»)} . 
2r Uo + Eli^+^)'^)l{y„^Mx„)} ' 
_^ lo / mo + Eli <j> n e-TV" H ^y" h ^H{y n = fe(x»)} 

" 2T OS Uo + Ell 0„ e + ^^(^)^Mx„) 1{ y n ^ /J^)} 

/ Vo + Ell 0ne- T »" g ( X ">l{i/n ^ M*n)} ' 

" 2t ° g Uo + Eli 4>ne-^ H ^)t{y n h( Xn )} 

± lo ( ^/Z + Eli <f>ne- Ty - H ^h{y n = h(* n )}/Z 
' 2r ° S ^o^ + El^e-*^)!^ ^ ^(x„)}/Z 



Jog I - 7777. . I dot) 



= — log ( M0//z + Eli ^My* = fe ( x »)> \ (10 3) 
2r s ^ /z + Eli^i{y«^Mx„)}; 

= Jl w f mq/^ + E1i^(i-i{^^Mx»)}) \ 

= 1 lQg ( M0/^+l-Ell^lfa^MXn)} ^ (105) 



2r ^ Ho/Z + Y,^ =1 dnMy n ^h(x n )} 



1 log ( ^ " ) (106) 
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S.7 Matlab code for Long-Servedio Data 



The code below can be used to generate samples. In the paper, we called the function with n=10 
and eta=0 .20. 



function [X y] = longservedio (m, n, eta) 





[X y] = longservedio (m, n, eta) 




inputs : 






m -> 


# of samples 




n -> 


data will have 2n+ll dimensions 




eta -> 


Bayes error 




outputs : 






X -> 


matrix m.by. (2n+ll) 




y -> 


m-vector of +1/-1 labels 




Source : 






Boosting: Foundations and Algorithms 




Shapire 


& Freund, MIT Press 2012 


Q. 



[12.3] 






m-file author: Alex Lorbert 


a, 









k = 2*n+l; 

X = zeros (m, 2 *n+ll ) ; 
y = zeros (m, 1 ) ; 



for i = l : m 

t = rand ( 1 ) ; 

if t < 1/4 % with probability 0.25 

xl = ones ( 1 , 2 *n+l ) ; 

x2 = get_random_bv ( 1 , ) ; 
elseif t < 3/4 % with probability 0.5 

xl = get_random_bv (k, 1 ) ; 

x2 = get_random_bv ( 1 , -2 ) ; 
else % with probability 0.25 

xl = get_random_bv (k, 1 ) ; 

x2 = ones (1,10); 

end 

X (i, : ) = [xl x2 ] ; 
s = rand ( 1 ) ; 
if s < 1-eta 
Y(i) = 1; 

else 

Y(i) = -l; 

end 

end 



return 



function x = get_random_bv (n, k) % get random binary vector of dimension n 

% such that the sum equals k 
m = (k + n)/2; % n and k will have same parity by construction 

x = ones ( 1 , n) ; 
x(l:m) = -1; 

x = x (randperm (n) ) ; % randomly permute 
return 
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